2.5 2D distributions

2.5.1 Rectangular binning in plotly.js

The plotly package provides two functions for displaying rectangular bins: add_heatmap() and add_histogram2d(). For numeric data, the add_heatmap() function is a 2D analog of add_bars() (bins must be pre-computed), and the add_histogram2d() function is a 2D analog of add_histogram() (bins can be computed in the browser). Thus, I recommend add_histogram2d() for exploratory purposes, since you don’t have to think about how to perform binning. It also provides a useful zsmooth attribute for effectively increasing the number of bins (currently, “best” performs a bi-linear interpolation, a type of nearest neighbors algorithm), and nbinsx/nbinsy attributes to set the number of bins in the x and/or y directions. Figure 2.25 compares three different uses of add_histogram(): (1) plotly.js’ default binning algorithm, (2) the default plus smoothing, (3) setting the number of bins in the x and y directions.

p <- plot_ly(diamonds, x = ~log(carat), y = ~log(price))
subplot(
  add_histogram2d(p) %>%
    colorbar(title = "default", len = 1/3, y = 1) %>%
    layout(xaxis = list(title = "default")),
  add_histogram2d(p, zsmooth = "best") %>%
    colorbar(title = "zsmooth", len = 1/3, y = 2/3 - 0.05) %>%
    layout(xaxis = list(title = "zsmooth")),
  add_histogram2d(p, nbinsx = 60, nbinsy = 60) %>%
    colorbar(title = "nbins", len = 1/3, y = 1/3 - 0.1) %>%
    layout(xaxis = list(title = "nbins")),
  shareY = TRUE, titleX = TRUE
)

Figure 2.25: Three different uses of histogram2d()

2.5.2 Rectangular binning in R

In Bars & histograms, we leveraged a number of algorithms in R for computing the “optimal” number of bins for a histogram, via hist(), and routing those results to add_bars(). There is a surprising lack of research and computational tools for the 2D analog, and among the research that does exist, solutions usually depend on characteristics of the unknown underlying distribution, so the typical approach is to assume a Gaussian form (David W. Scott 1992). Kernel density estimation is another non

Practically speaking, that assumption is not very useful, but thankfully , and specifically the kde2d() function from the MASS package, provides a well-supported rule-of-thumb.

kde_count <- function(x, y, ...) {
  kde <- MASS::kde2d(x, y, ...)
  df <- with(kde, setNames(expand.grid(x, y), c("x", "y")))
  df$count <- with(kde, c(z) * length(x) * diff(x)[1] * diff(y)[1])
  data.frame(df)
}

kd <- with(diamonds, kde_count(log(carat), log(price), n = 30))
plot_ly(kd, x = ~x, y = ~y, z = ~count) %>% 
  add_heatmap() %>%
  colorbar(title = "Number of diamonds")

Figure 2.26: 2D Density estimation via the kde2d() function

2.5.3 Basic heatmaps

The add_heatmap() function can also handle categorical x and y variables.

corr <- cor(diamonds[vapply(diamonds, is.numeric, logical(1))])
plot_ly(x = rownames(corr), y = colnames(corr), z = corr) %>%
  add_heatmap() %>%
  colorbar(limits = c(-1, 1))

A heatmap with categorical axes is basically a two-way table. The Titanic dataset is a four-way table that we can put into a tidy data frame using the tidy() function from the broom package.

broom::tidy(Titanic) %>%
  plot_ly(x = ~interaction(Age, Survived), y = ~interaction(Class, Sex)) %>%
  add_heatmap(z = ~Freq)


#one_heatmap <- function(d) {
#  plot_ly(d, x = ~Class, y = ~Sex) %>% 
#    add_heatmap(z = ~Freq) %>%
#    add_annotations(
#      ~unique(paste(Age, Survived, sep = ", ")), x = 0.5, y = 1, 
#      yanchor = "bottom", xanchor = "middle",
#      xref = "paper", yref = "paper", showarrow = FALSE
#    )
#}
#
#tidy(Titanic) %>%
#  group_by(Age, Survived) %>%
#  do(p = one_heatmap(.)) %>%
#  subplot(
#    nrows = 2, margin = c(0.01, 0.01, 0.04, 0.04),
#    shareX = TRUE, shareY = TRUE, 
#    titleY = FALSE, titleX = FALSE
#  )

2.5.4 Contours

plot_ly(diamonds, x = ~cut, y = ~clarity) %>% 
  add_histogram2dcontour()